Diffusion Schrodinger Bridge with Applications to Score-Based Generative Modeling

Link to the paper: [link]

1. Discrete-Time: Markov Chains and Time Reversal (Recap)

Consider a data distribution with positive density pdatap_{\text{data}}, a positive prior density ppriorp_{\text{prior}}, and a Markov chain with initial density p0=pdatap_0 = p_{\text{data}} on Rd\mathbb{R}^d evolving according to positive transition densities pk+1kp_{k+1|k}  for k{0,,N1}k \in \{ 0, \dots, N-1 \}. By the Markov property, any x0:N={xk}k=0NX=(Rd)N+1x_{0:N} = \{ x_k \}_{k=0}^N \in \mathcal{X} = (\mathbb{R}^d)^{N+1}, the joint density can be expressed as:

p(x0:N)=p0(x0)k=0N1pk+1k(xk+1xk)\begin{equation} p(x_{0:N}) = p_0(x_0) \prod_{k=0}^{N-1} p_{k+1 | k}(x_{k+1} | x_k) \end{equation}

The joint density also admits the backward decomposition:

p(x0:N)=pN(xN)k=0N1pkk+1(xkxk+1)\begin{equation} p(x_{0:N}) = p_N(x_N) \prod_{k=0}^{N-1} p_{k|k+1}(x_k | x_{k+1}) \end{equation}

where pk(xk)=pkk1(xkxk1)pk1(xk1)dxk1p_k(x_k) = \int p_{k|k-1}(x_k | x_{k-1}) p_{k-1}(x_{k-1}) dx_{k-1} is the marginal density at step k1k \geq 1.

Why the backward decomposition is correct?

Prove by induction:

When N=1N=1, obviously it’s correct.

Assume that the equation holds for N=M,MN+N = M, M \in \mathbb{N}_+, then we prove it also holds for N=M+1N = M+1.

p(x0:M+1)=p(x0:M)p(xM+1x0:M)=[pM(xM)k=0M1pkk+1(xkxk+1)]p(xM+1x0:M)(by assumption)=[pM(xM)k=0M1pkk+1(xkxk+1)]pM+1M(xM+1xM)(by Markov Property)=pM+1(xM+1)pMM+1(xMxM+1)k=0M1pkk+1(xkxk+1)(by Bayes Rule)=pM+1(xM+1)k=0Mpkk+1(xkxk+1)\begin{align*} p(x_{0:M+1}) &= p(x_{0:M}) p(x_{M+1} | x_{0:M}) \\ &= \left[ p_M(x_M) \prod_{k=0}^{M-1} p_{k|k+1}(x_k | x_{k+1}) \right] p(x_{M+1} | x_{0:M}) \quad \text{(by assumption)} \\ &= \left[ p_M(x_M) \prod_{k=0}^{M-1} p_{k|k+1}(x_k | x_{k+1}) \right] p_{M+1|M}(x_{M+1} | x_M) \quad \text{(by Markov Property)} \\ &= p_{M+1}(x_{M+1}) p_{M | M + 1}(x_M | x_{M+1}) \prod_{k=0}^{M-1} p_{k|k+1}(x_k | x_{k+1}) \quad \text{(by Bayes Rule)} \\ &= p_{M+1}(x_{M+1}) \prod_{k=0}^{M} p_{k|k+1}(x_k | x_{k+1}) \end{align*}

End of Proof

For the purpose of generative modeling, the transition densities are chosen such that pN(xN)=p(x0:N)dx0:N1pprior(xN)p_N(x_N) = \int p(x_{0:N}) dx_{0:N-1} \approx p_{\text{prior}} (x_N) for large NN, where ppriorp_{\text{prior}} is an easy-to-sample prior density. To sample approximately from pdatap_{\text{data}}, one may use ancestral sampling with Equation (2), i.e. first sample XNppriorX_N \sim p_{\text{prior}} followed by Xkpkk+1(Xk+1)X_k \sim p_{k|k+1} (\cdot | X_{k+1}) for k{N1,,0}k \in \{ N-1, \cdots, 0 \}.

Equation (2) cannot be simulated exactly, but may be approximated if we consider a forward transition density of the form:

pk+1k(xk+1xk)=N(xk+1;xk+γk+1f(xk),2γk+1I)\begin{equation} p_{k+1 | k} (x_{k+1} | x_k) = \mathcal{N} (x_{k+1}; x_k + \gamma_{k+1} f(x_k), 2 \gamma_{k+1} \bold{I}) \end{equation}

with drift f:RdRdf : \mathbb{R}^d \rightarrow \mathbb{R}^d and stepsize γk+1>0\gamma_{k+1} > 0. Equation (2) can be first approximated by the following equation:

pkk+1(xkxk+1)=pk+1k(xk+1xk)exp[logpk(xk)logpk+1(xk+1)]N(xk;xk+1γk+1f(xk+1)+2γk+1logpk+1(xk+1,2γk+1I)\begin{equation} \begin{align*} p_{k | k+1}(x_k | x_{k+1}) &= p_{k+1 | k}(x_{k+1} | x_k) \exp [\log p_k(x_k) - \log p_{k+1} (x_{k+1})] \\ &\approx \mathcal{N}(x_k; x_{k+1} - \gamma_{k+1} f(x_{k+1}) + 2 \gamma_{k+1}\nabla \log p_{k+1} (x_{k+1}, 2\gamma_{k+1} \bold{I}) \end{align*} \end{equation}

using that pkpk+1p_k \approx p_{k+1}, a Taylor expansion of logpk+1\log p_{k+1} at xk+1x_{k+1} and f(xk)f(xk+1)f(x_k) \approx f(x_{k+1}).

How to derive the approximation?

First,

pkk+1(xkxk+1)=pk+1k(xk+1xk)p(xk)p(xk+1)=pk+1k(xk+1xk)exp(logp(xk)p(xk+1))=pk+1k(xk+1xk)exp[logpk(xk)logpk+1(xk+1)]\begin{align*} p_{k | k+1} (x_k |x_{k+1}) &= p_{k+1 | k}(x_{k+1} | x_k) \frac{p(x_k)}{p(x_{k+1})} \\ &= p_{k+1 | k}(x_{k+1} | x_k) \exp\left( \log \frac{p(x_k)}{p(x_{k+1})} \right) \\ &= p_{k+1 | k}(x_{k+1} | x_k) \exp [\log p_k(x_k) - \log p_{k+1} (x_{k+1})] \end{align*}

By Taylor expansion of logpk+1(x)\log p_{k+1}(x) at x=xk+1x = x_{k+1}:

logpk+1(x)logpk+1(xk+1)+logpk+1(xk+1)T(xxk+1)\log p_{k+1}(x) \approx \log p_{k+1}(x_{k+1}) + \nabla \log p_{k+1}(x_{k+1})^T (x - x_{k+1})

Since pkpk+1p_k \approx p_{k+1}, we have logpk(xk)logpk+1(x)\log p_k (x_k) \approx \log p_{k+1}(x). Plug it into the above approximation:

logpk(xk)logpk+1(xk+1)logpk+1(xk+1)T(xkxk+1)\log p_k(x_k) - \log p_{k+1}(x_{k+1}) \approx \nabla \log p_{k+1}(x_{k+1})^T (x_k - x_{k+1})

Since pk+1k(xk+1xk)p_{k+1 | k} (x_{k+1} | x_k) is a Gaussian, we have:

pk+1k(xk+1xk)exp[logpk(xk)logpk+1(xk+1)]exp(14γk+1[xk+1xkγf(xk+1)]2)exp(logpk+1(xk+1)T(xkxk+1))exp(14γk+1[xk22(xk+1γk+1f(xk+1))Txk4γk+1logpk+1(xk+1)T(xkxk+1)])exp(14γk+1[xk22xkT(xk+1γk+1f(xk+1)+2γk+1logpk+1(xk+1)])\begin{align*} &p_{k+1 | k}(x_{k+1} | x_k) \exp [\log p_k(x_k) - \log p_{k+1} (x_{k+1})] \\ \propto & \exp \left( -\frac{1}{4 \gamma_{k+1}} \left[ x_{k+1} - x_k - \gamma f(x_{k+1}) \right]^2 \right) \cdot \exp \left( \nabla \log p_{k+1}(x_{k+1})^T (x_k - x_{k+1}) \right) \\ \propto& \exp \left( -\frac{1}{4 \gamma_{k+1}} \left[ ||x_k||^2 - 2(x_{k+1} - \gamma_{k+1} f(x_{k+1}))^T x_k - 4 \gamma_{k+1} \nabla \log p_{k+1}(x_{k+1})^T (x_k - x_{k+1}) \right] \right) \\ \propto& \exp \left( -\frac{1}{4 \gamma_{k+1}} [||x_k||^2 - 2x_k^T (x_{k+1} - \gamma_{k+1} f(x_{k+1}) + 2\gamma_{k+1} \nabla \log p_{k+1}(x_{k+1})] \right) \end{align*}

Since only Gaussian distribution has the quadratic exponential kernel, we conclude that:

pkk+1(xkxk+1)N(xk;xk+1γk+1f(xk+1)+2γk+1logpk+1(xk+1),2γk+1I)p_{k|k+1}(x_k | x_{k+1}) \approx \mathcal{N}(x_k ; x_{k+1} - \gamma_{k+1} f(x_{k+1}) + 2 \gamma_{k+1} \nabla \log p_{k+1}(x_{k+1}), 2 \gamma_{k+1} \bold{I})

In practice, the approximation holds if xk+1xk||x_{k+1} - x_k|| is small which is ensured by choosing γk+1\gamma_{k+1} small enough. logpk+1\nabla \log p_{k+1} is not available, but one can obtain its approximation using denoising score-matching methods.

We assume that the conditional density pk+10(xk+1x0)p_{k+1 | 0} (x_{k+1} | x_0) is available analytically(e.g. gradient of Gaussian). We can show that logpk+1(xk+1)=Ep0k+1[xk+1logpk+10(xk+1X0)]\nabla \log p_{k+1}(x_{k+1}) = \mathbb{E}_{p_{0 | k+1}} \left[ \nabla_{x_{k+1}} \log p_{k+1 | 0} (x_{k+1} |X_0) \right].

Prove the Equality.

First, a fact about the gradient:

logf(x)=f(x)f(x)f(x)logf(x)=f(x)\begin{align*} \nabla \log f(x) &= \frac{\nabla f(x)}{f(x)} \\ f(x) \nabla \log f(x) &= \nabla f(x) \end{align*}

Since pk+1(xk+1)=p0(x0)pk+10(xk+1x0)dx0p_{k+1} (x_{k+1}) = \int p_0(x_0) p_{k+1 | 0}(x_{k+1} |x_0) dx_0, we have:

logpk+1(xk+1)=xk+1pk+1(xk+1)pk+1(xk+1)=p0(x0)pk+1(xk+1)xk+1pk+10(xk+1x0)dx0=p0(x0)pk+10(xk+1x0)pk+1(xk+1)xk+1logpk+10(xk+1x0)dx0=p0k+1(x0xk+1)xk+1logpk+10(xk+1x0)dx0=Ep0k+1[xk+1logpk+10(xk+1x0)]\begin{align*} \nabla \log p_{k+1}(x_{k+1}) &= \frac{\nabla_{x_{k+1}} p_{k+1}(x_{k+1})}{p_{k+1}(x_{k+1})} \\ &= \int \frac{p_0(x_0)}{p_{k+1}(x_{k+1})} \cdot \nabla_{x_{k+1}} p_{k+1 | 0}(x_{k+1} |x_0) dx_0 \\ &= \int \frac{p_0(x_0) p_{k+1|0}(x_{k+1} | x_0)}{p_{k+1}(x_{k+1})} \cdot \nabla_{x_{k+1}} \log p_{k+1 | 0}(x_{k+1} |x_0) dx_0 \\ &= \int p_{0 | k+1}(x_0 | x_{k+1}) \cdot \nabla_{x_{k+1}} \log p_{k+1 | 0}(x_{k+1} |x_0) dx_0 \\ &= \mathbb{E}_{p_{0 | k+1}} \left[ \nabla_{x_{k+1}} \log p_{k+1 | 0}(x_{k+1} |x_0)\right] \end{align*}

Therefore we can formulate the score estimation as a regression problem and use a flexible class of functions, e.g. neural networks, to parameterize an approximation sθ(k,xk)logpk(xk)s_{\theta^*}(k, x_k) \approx \nabla \log p_k(x_k) such that:

θ=arg minθk=1NEp0,k[sθ(k,Xk)xklogpk0(XkX0)2]\theta^* = \argmin_{\theta} \sum_{k=1}^N \mathbb{E}_{p_{0, k}} [||s_{\theta}(k, X_k) - \nabla_{x_k}\log p_{k|0}(X_k | X_0)||^2]

where p0,k=p0(x0)pk0(xkx0)p_{0, k} = p_0(x_0) p_{k|0}(x_k | x_0). This can be done by getting a sample x0x_0 from the dataset and use pk0p_{k|0} to obtain the corresponding xkx_k.

If pk0p_{k|0} is not available, we use θ=arg minθk=1NEpk1,k[sθ(k,Xk)xklogpkk1(XkXk1)2]\theta^* = \argmin_{\theta} \sum_{k=1}^N \mathbb{E}_{p_{k-1, k}}[||s_{\theta}(k, X_k) - \nabla_{x_k} \log p_{k | k-1}(X_k | X_{k-1}) ||^2].

In Summary, Score-based Generative Modeling involves first estimating the score function sθs_{\theta^*} from noisy data, and the sampling X0 X_0 using XNppriorX_N \sim p_{\text{prior}} with ancestral sampling and approximation (Equation (4)), i.e.

Xk=Xk+1γk+1f(Xk+1)+2γk+1sθ(k+1,Xk+1)+2γk+1Zk+1\begin{equation} X_k = X_{k+1} - \gamma_{k+1} f(X_{k+1}) + 2 \gamma_{k+1} s_{\theta^*}(k+1, X_{k+1}) + \sqrt{2 \gamma_{k+1}} Z_{k+1} \end{equation}

where Zk+1i.i.dN(0,I)Z_{k+1} \overset{\text{i.i.d}}{\sim} \mathcal{N}(0, \bold{I}).

2. Continuous-Time: SDEs, Reverse-Time SDEs and Theoretical results

The Markov chain with kernel in Equation (3) corresponds to an Euler-Maruyama discretization of (Xt)t[0,T](\bold{X}_t)_{t \in [0, T]}, solving the following SDE:

dXt=f(Xt)dt+2dBt,X0p0=pdata\begin{equation} d \bold{X}_t = f(\bold{X}_t) dt + \sqrt{2} d \bold{B}_t, \qquad \bold{X}_0 \sim p_0 = p_{\text{data}} \end{equation}

where (Bt)t[0,T](\bold{B}_t)_{t \in [0, T]} is a Brownian motion and f:RdRdf: \mathbb{R}^d \rightarrow \mathbb{R}^d is regular enough so that solutions exists.

Question on Equation (6)

If strictly follow Equation (3), the SDE should be:

dXt=γt+1f(Xt)dt+2γt+1dBtd\bold{X}_t = \gamma_{t+1} f(\bold{X}_t) dt + \sqrt{2 \gamma_{t+1}} d \bold{B}_t

The discretization gives,

Xt+1Xt=γt+1f(Xt)+2γt+1ϵ,ϵN(0,I)Xt+1=Xt+γt+1f(Xt)+2γt+1ϵ=N(Xt+1;Xt+γt+1f(Xt),2γt+1I)\bold{X}_{t+1} - \bold{X}_t = \gamma_{t+1} f(\bold{X}_t) + \sqrt{2 \gamma_{t+1}} \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \bold{I}) \\ \Rightarrow \bold{X}_{t+1} = \bold{X}_t + \gamma_{t+1} f(\bold{X}_t) + \sqrt{2 \gamma_{t+1}} \epsilon = \mathcal{N} (\bold{X}_{t+1}; \bold{X}_t + \gamma_{t+1} f(\bold{X}_t), 2 \gamma_{t+1} \bold{I})

Under some conditions on ff, the reverse-time process (Yt)t[0,T]=(XTt)t[0,T](\bold{Y}_t)_{t \in [0, T]} = (\bold{X}_{T - t})_{t \in [0, T]} satisfies

dYt={f(Yt)+2logpTt(Yt)}dt+2dBt\begin{equation} \text{d} \bold{Y}_t = \{ -f(\bold{Y}_t) + 2 \nabla \log p_{T-t}(\bold{Y}_t) \} \text{d} t + \sqrt{2} \text{d} \bold{B}_t \end{equation}

with initialization Y0pT\bold{Y}_0 \sim p_T, where ptp_t denotes the marginal density of Xt\bold{X}_t.

Another notation in Yang Song’s paper.

In another paper, the reverse-time SDE is denoted as:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉ\text{d} \bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] \text{d} t + g(t) \text{d} \bar{\bold{w}}

where the time flows backwards from T to 0 with initialization xTpT\bold{x}_T \sim p_T.

Two notations are equivalent since the flow direction of time are reversed.

The reverse-time Markov chain {Yk}k=0N\{ Y_k \}_{k=0}^N associated with Equation (5) corresponds to an Euler-Maruyama discretization of Equation (7), where the score function are approximated by sθ(t,x)s_{\theta^*} (t, x).

Let’s consider f(x)=αxf(x) = -\alpha x for α0\alpha \geq 0. This framework includes the one of Song and Ermon (2019) (α>0\alpha > 0, pprior(x)=N(x;0,2TI)p_{\text{prior}}(x) = \mathcal{N}(x; 0, 2T \bold{I})) for which (Xt)t[0,T](\bold{X}_t)_{t\in[0,T]} is simply a Brownian motion. It also includes Ho et al. (2020) (α=0\alpha = 0, pprior(x)=N(x;0,I/α)p_{\text{prior}}(x) = \mathcal{N}(x; 0, \bold{I} / \alpha)) for which it is an Ornstein-Uhlenbeck process.

3. General SGM and links with existing works

Appendix C.3

General SGM Algorithm

Consider the forward process

dXt=ft(Xt)dt+2dBt\text{d} \bold{X}_t = f_t(\bold{X}_t) \text{d}t + \sqrt{2} \text{d} \bold{B}_t

The discretization gives:

Xk+1=Xk+γk+1fk(Xk)+2γk+1Zk+1X_{k+1} = X_k + \gamma_{k+1} f_k(X_k) + \sqrt{2 \gamma_{k+1}} Z_{k+1}

In general, we don’t have that p(xkx0)p(x_k | x_0) is a Gaussian density. However, we can obtain that for any xRdx \in \mathbb{R}^d,

pk+1(x)=(4πγk+1)d/2Rdpk(x~)exp[Tk+1(x~)x2/(4γk+1)]dx~p_{k+1}(x) = (4 \pi \gamma_{k+1})^{-d / 2} \int_{\mathbb{R}^d} p_k(\tilde{x}) \exp [- ||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x}

with Tk+1(x)=x~+γk+1fk(x~)\mathcal{T}_{k+1}(x) = \tilde{x} + \gamma_{k+1}f_k(\tilde{x}). After some math, we can obtain:

logpk+1(x)=(2γk+1)1/2E[Zk+1Xk+1=x]\nabla \log p_{k+1} (x) = -(2 \gamma_{k+1})^{1/2} \mathbb{E} [Z_{k+1} | X_{k+1} = x]

How is that equation derived?

Denote Z~k+1=2γk+1Zk+1=N(0,2γk+1I)\tilde{Z}_{k+1} = \sqrt{2 \gamma_{k+1}} Z_{k+1} = \mathcal{N}(0, 2 \gamma_{k+1} \bold{I}), then,

pk+1(x)=pk(x~)N(xTk+1(x~);0,2γk+1I)dx~=pk(x~)(4πγk+1)d/2exp[Tk+1(x~)x2/(4γk+1)]dx~\begin{align*} p_{k+1}(x) &= \int p_k(\tilde{x}) \mathcal{N}(x - \mathcal{T}_{k+1}(\tilde{x}); 0, 2 \gamma_{k+1} \bold{I}) \text{d} \tilde{x} \\ &= \int p_k(\tilde{x}) (4\pi \gamma_{k+1})^{-d / 2} \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x} \end{align*}

Take gradient on both side of the equation, and change the order of gradient and integration, we get:

pk+1(x)=pk(x~)Tk+1(x~)x2γk+1(4πγk+1)d/2exp[Tk+1(x~)x2/(4γk+1)]dx~\begin{align*} \nabla p_{k+1}(x) &= \int p_k(\tilde{x}) \frac{\mathcal{T}_{k+1}(\tilde{x}) - x}{2 \gamma_{k+1}} (4\pi \gamma_{k+1})^{-d / 2} \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x} \end{align*}

Since pk+1(x)=pk+1(x)logpk+1(x)\nabla p_{k+1}(x) = p_{k+1}(x) \nabla \log p_{k+1}(x), we get derive:

[2γk+1pk+1(x)]logpk+1(x)=pk(x~)[Tk+1(x~)x](4πγk+1)d/2exp[Tk+1(x~)x2/(4γk+1)]dx~2γk+1logpk+1(x)=pk(x~)[Tk+1(x~)x]exp[Tk+1(x~)x2/(4γk+1)]dx~pk(x~)exp[Tk+1(x~)x2/(4γk+1)]dx~\begin{align*} [2\gamma_{k+1} p_{k+1}(x)] \nabla \log p_{k+1}(x) &= \int p_k(\tilde{x}) [\mathcal{T}_{k+1}(\tilde{x}) - x] (4\pi \gamma_{k+1})^{-d / 2} \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x} \\ 2 \gamma_{k+1} \nabla \log p_{k+1}(x) &= \frac{\int p_k(\tilde{x}) [\mathcal{T}_{k+1}(\tilde{x}) - x] \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x}}{\int p_k(\tilde{x}) \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x}} \end{align*}

Now we inspect the conditional probability pk+1kp_{k+1|k}. Note that:

pk+1k(xx~)=pZk+1(12γk+1[xTk+1(x~)])=exp[Tk+1(x~)x2/(4γk+1)]\begin{align*} p_{k+1|k}(x|\tilde{x}) &= p_{Z_{k+1}}(\frac{1}{\sqrt{2 \gamma_{k+1}}}[x - \mathcal{T}_{k+1}(\tilde{x})])\\ &= \exp[- ||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \end{align*}

Then, by Bayes’ Formula,

pkk+1(x~x)=pk+1k(xx~)pk(x~)pk+1(x)=1pk+1(x)pk(x~)exp[Tk+1(x~)x2/(4γk+1)]\begin{align*} p_{k|k+1}(\tilde{x} | x) &= \frac{p_{k+1|k}(x|\tilde{x})p_k(\tilde{x})}{p_{k+1}(x)}\\ &=\frac{1}{p_{k+1}(x)} p_k(\tilde{x}) \cdot \exp[- ||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \end{align*}

Therefore, the score function can be written as:

logpk+1(x)=12γk+1[Tk+1(x~)x]pk(x~)exp[Tk+1(x~)x2/(4γk+1)]/pk+1(x)dx~pk(x~)exp[Tk+1(x~)x2/(4γk+1)]/pk+1(x)dx~=12γk+1[Tk+1(x~)x]pkk+1(x~x)dx~=12γk+1EXkpkk+1[Tk+1(Xk)Xk+1Xk+1=x]=12γk+1EZk+1N(0,I)[2γk+1Zk+1Xk+1=x]=(2γk+1)12E[Zk+1Xk+1=x]\begin{align*} \nabla \log p_{k+1}(x) &= \frac{1}{2\gamma_{k+1}} \frac{\int [\mathcal{T}_{k+1}(\tilde{x}) - x] p_k(\tilde{x}) \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] / p_{k+1}(x) \text{d} \tilde{x}}{\int p_k(\tilde{x}) \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] / p_{k+1}(x) \text{d} \tilde{x}} \\ &= \frac{1}{2\gamma_{k+1}} \int [\mathcal{T}_{k+1}(\tilde{x}) - x] p_{k|k+1}(\tilde{x} | x) \text{d} \tilde{x} \\ &= \frac{1}{2\gamma_{k+1}} \mathbb{E}_{X_k \sim p_{k|k+1}} [\mathcal{T}_{k+1}(X_k) - X_{k+1} | X_{k+1} = x] \\ &= \frac{1}{2\gamma_{k+1}} \mathbb{E}_{Z_{k+1} \sim \mathcal{N}(0, I)} [-\sqrt{2\gamma_{k+1}} Z_{k+1} | X_{k+1} = x] \\ &= - (2 \gamma_{k+1})^{-\frac{1}{2}} \cdot \mathbb{E}[Z_{k+1} | X_{k+1} = x] \end{align*}